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Abstract 

Dynamic texture and scene classification are two fundamental problems in un¬ 
derstanding natural video content. Extracting robust and effective features is a 
crucial step towards solving these problems. However the existing approaches 
suffer from the sensitivity to either varying illumination, or viewpoint changing, 
or even camera motion, and/or the lack of spatial information. Inspired by the 
success of deep structures in image classification, we attempt to leverage a deep 
structure to extract feature for dynamic texture and scene classification. To 
tackle with the challenges in training a deep structure, we propose to transfer 
some prior knowledge from image domain to video domain. To be specific, we 
propose to apply a well-trained Convolutional Neural Network (ConvNet) as a 
mid-level feature extractor to extract features from each frame, and then form a 
representation of a video by concatenating the first and the second order statis¬ 
tics over the mid-level features. We term this two-level feature extraction scheme 
as a Transferred ConvNet Feature (TCoF). Moreover we explore two different 
implementations of the TCoF scheme, i.e., the spatial TCoF and the temporal 
TCoF, in which the mean-removed frames and the difference between two ad¬ 
jacent frames are used as the inputs of the ConvNet, respectively. We evaluate 
systematically the proposed spatial TCoF and the temporal TCoF schemes on 
three benchmark data sets, including DynTex, YUPENN, and Maryland, and 
demonstrate that the proposed approach yields superior performance. 

Keywords: Dynamic Texture Classification, Dynamic Scene Classification, 
Transferred ConvNet Feature, Convolutional Neural Network 


1. Introduction 


Dynamic texture and dynamic scene classification are two fundamental prob¬ 
lems in understanding natural video content and ha ve g ained considerable re¬ 
search attention [H ES 0, 0, El El EH, S, E3, [3, El 0, El 0] . Roughly, 

dynamic textures can be described as visual processes, which consist of a group 
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of particles with random motions; dynamic scenes can be considered as places 
where events occur. In Fig. [TJ we show some sample images from a dynamic 
scene data set YUPENN 11]. The ability to automatically categories dynamic 
textures or scenes is useful, since it can be used to recognize the presence of 
events, surfaces, actions, and phenomena in a video surveillance system. 

However automatically categorizing dynamic textures or dynamic scenes is 
a challenging problem, since the existence of a wide range of naturally occurring 
variations in a short video, e.g., illumination variations, viewpoint changes, or 
even significant camera motions. It is commonly accepted that constructing a 
robust and effective representation of a video sequence is a crucial step towards 
solving these problems. In the past decade, a large number of methods for video 
representation have been proposed, e.g., Linear Dynamic System (LDS) based 
methods 3 3ll T[ [pi . GIST based method (25|, Local Binary Pattern (LBP) 
based methods [2414 (A |28[ [2il [dij ]. and Wavelet based methods [HI, [h| H3- 
[3^]. Unfortunately, the existing approaches suffer from the sensitivity to either 
varying illumination, or viewpoint changing, or even the camera motion, and/or 
the lack of spatial information. 

Recently there is a surge of research interests in developing deep structures 



image classification, in this paper, we attempt to leverage a deep structure to 
extract feature for dynamic texture and scene classification. However, learning 
a deep structure needs huge amount of train data and is quite expensive in 
computational demand. Unfortunately, as in other video classification tasks, 
the dynamic textures and scenes classification tasks suffer from the small size 
of training data. As a result, the lack of training data is actually an obstacle to 
deploy a deep structure for video classification tasks. 

By noticing of that there are a lots of work in learning deep structures 
for classifying images, in this paper, we attempt to transfer the knowledge in 
image domain to compensate the deficiency of training data in training a deep 
structure to represent dynamic textures and scenes. Concretely, we propose to 
apply a well-trained Convolutional Neural Network (ConvNet) as a mid-level 
feature extractor to extract features from each frame in a video, and then form 
a representation of a video by concatenating the first and the second order 
statistics over the mid-level features. We term this two-level feature extraction 
scheme as a Transferred ConvNet Feature (TCoF). 

Our aim in this paper is to explore a robust and effective way to capture 
the spatial and temporal information in dynamic textures and scenes. To be 
specific, our contributions are highlighted as follows: 


• We propose a two-level feature extraction scheme to represent dynamic 
textures and scenes, which applies a trained Convolutional Neural Network 
(ConvNet) as a feature extractor to extract mid-level features from each 
frame in a video and then computes the first and the second order statistics 
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Figure 1: Sample images from dynamic scene data set YUPENN. Each row corresponds a 
category. 


over the mid-level features. To the best of our knowledge, this is the 
first investigation of using a deep network with transferred knowledge to 
represent dynamic texture or scenes. 

• We investigate the effects of the spatial and temporal mid-level features 
on three benchmark data sets. Experimental results show that: a) the 
spatial feature is more effective for categorizing the dynamic textures and 
dynamic scenes and b) when the video is stabilized the temporal feature 
could provide some complementary information. 

The remainder of the paper is organized as follows. We review the related 
studies in Section[2]and present our proposals in Section[3] We evaluate the pro¬ 
posed spatial and temporal TCoF schemes in Section [4] and finally we conclude 
this paper with a discussion in Section [5j 


2. Related Work 


In the literature, there are numerous approaches for dynamic texture and 
scene classification. While being closely relevant, dynamic texture classification 

f , [271 . 1 1 4 [fol . H31 . 1 1 2l I3IL [ft and dynamic scene classification SEMEME 
are usually considered separately as two different problems by far. 

The research history of dynamic texture classification is much longer than 
that of the dynamic scene. The later, as far as we know, started since two 
dynamic scene data sets - Maryland Dynamic Scene data set “in the wild” [34| 
and York stabilized Dynamic Scene data set 27] - were released. Although 
there might not be a clear distinction in nature, the slight difference of dynamic 
texture from dynamic scene is that the frames in a video of dynamic texture 
consist of images with richer texture whereas the frames in a video of dynamic 
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scene are a natural scene involving over time. In addition, having mentioned of 
the data sets, compared to dynamic textures which are usually stabilized videos, 
the dynamic scene data set might include some significant camera motions. 

The critical challenges in categorizing the dynamic textures or scenes come 
from the wide range of variations around the naturally occurring phenomena. 
To overcome the difficulty, numerous methods for video representation have 
been proposed. Among them, Linear Dynamic System (LDS) based meth¬ 
ods 1J, 3~l[ [71 [ii[ . GIST based method (HJ, Local Binary Pattern (LBP) based 
met ho ds [2J, 0, 0, 0, E3 , and wavelet based methods 0, 0 0 SI are 
the most widely used. LDS is a statistical generative model which captures the 
spatial appearance and dynamics in a video 0- While LDS yields promising 
performance on viewpoint-invariant sequences, it performs poor on viewpoint- 
1311 111 lyj] . Besides, it is also sensitive to illumination vari- 
represents the spatial envelope of an image (or a frame in 
video) holistically by Gabor filter. However GIST suffers from scale and rota¬ 
tion variations. Among LBP based methods, Local Binary Pattern on Three 
Orthogonal Planes (LBP-TOP) [40| is the most widely used. LBP-TOP de¬ 
scribes a video by computing local binary pattern from three orthogonal planes 
( xy , xt and yt ) only. After LBP-TOP, several variants have been proposed, 
e.g., Local Ternary Pattern on Three Orthogonal Planes (LTP-TOP) 0], We¬ 
ber Local Descriptor on Three Orthogonal Planes (WLD-TOP) Q, Local Phase 
Quantization on Three Orthogonal Planes (LQP-TOP) !30j. While LBP-TOP 
and its variants are effective at capturing spatial and temporal information 
and robust to illumination variations, they are suffering from camera motions. 
Recently, wavelet based methods are also proposed, e.g., Spatiotemporal Ori¬ 
ented Energy (SOE) EH , Wavelet Domain Multifractal Analysis (WDMA) EH, 
and Bag-of-Spacetime-Energy (BoSE)|l6|- Combined with the Improved Fisher 
Vector (IFV) encoding strategy [2(1 [j]7 BoSE leads to the state-of-the-art per¬ 
formance on dynamic scene classification. However, the computational cost of 
BoSE is expensive due to slow feature extraction and quantization. 

The aforementioned methods can be roughly divided into two categories: 
the global approaches and the local approaches. The global approaches extract 
features from each frame in a video sequence by treating each frame as a whole, 
e.g., LDS 14] and GIST [0] - While the global approaches describe the spatial 
layout information well, they suffer from the sensitivity to illumination varia¬ 
tions, viewpoint changes, or scale and rotation variations. The local approaches 
construct a statistics (e.g., histogram) on a bunch of features extracted from 
local patches in each frame or local volumes in a video sequence, including 
LBP-TOP jH, LQP-TOP [0, BoSE 0, Bag of LDS [0. While the local 


approaches are robustness to transformations (e.g., rotation, illumination), they 
suffer from the lack of spatial layout information which is important to represent 
a dynamic texture or dynamic scene. 

In this paper, we attempt to leverage a deep structure with transferred 
knowledge from image domain to construct a robust and effective representation 
for dynamic textures and scenes. To be specific, we propose to use a pre¬ 
trained ConvNet - which has been trained on the large-scale image data set 
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Figure 2: The typical structure of a ConvNet. 


ImageNet 211, HI, 1 - as transferred (prior) knowledge, and then fine-tune 
the ConvNet with the frames in the videos of training set. Equipped with a 
trained ConvNet, we extract mid-level features from each frame in a video and 
represent a video by the concatenation of the first and the second order statistics 
over the mid-level features. 

Compare to previous studies, our approach possesses the following advan¬ 
tages: 


• Our approach represents a video with a two-level strategy. The deep 
structure used in the frame level is easier to train or even train-free, since 
we can adopt prior knowledge from image domain. 

• The extracted frame-level features are robust to translations, small scale 
variations, partial rotations, and illumination variations. 

• Our approach represents a video sequence by a concatenation of the first 
and the second order statistics of the frame-level features. This process is 
fast and effective. 


In the next section, we will present the framework and two different imple¬ 
mentations of our proposal. 


3. Our Proposal: Transferred ConvNet Feature (TCoF) 

Our TCoF scheme consists of three stages: 

• Constructing a ConvNet with transferred knowledge from image domain; 

• Extracting the mid-level feature with the ConvNet from each frame in a 
video; 

• Forming the video-level representation by concatenating the calculated 
first and the second order statistics over the frame-level features. 

3.1. Convolutional Neural Network with Transferred Knowledge for Extracting 
Frame-Level Features 

Notice that there are a lots of work in learning deep structures for classifying 
images. Among them, Convolutional Neural Networks (ConvNets) have been 
demonstrated to be extremely successful in computer vision [22, [M, |32l [fjj, III |2| • 
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Figure 3: The architecture of the ConvNet used in our TCoF scheme. 


We show a typical structure of a ConvNet in Fig. [2] The ConvNet consists of 
two types of layers: convolutional layers and full-connected layers. The convo¬ 
lutional part, as shown in the left panel of Fig. [2j consists of three components 
- convolutions, Local Contrast Normalization (LCN), and pooling. Among the 
three components, the convolution block is compulsory, and LCN and the pool¬ 
ing are optional. The convolution components capture complex image struc¬ 
tures. The LCN achieves invariance to illumination variations. The pooling 
component can not only yield partial invariance to scale variations and transla¬ 
tions, but also reduce the complexity for the downstream layers. Due to sharing 
parameters which is motivated by the local reception field in biological vision 
system, the number of free parameters in the convolutional layer are signifi¬ 
cantly reduced. The full-connected layer, as shown in the right panel of Fig. [2j 
is the same as a multi-layer perception neural network. 

In our TCoF framework, we use a ConvNet with five convolutional layers 
and two full-connected layers as shown in Fig. [3] which is the same as the 
most successful ConvNet implementation introduced by Krizhevsky et al. |U| 
and won the large-scale ImageNet contest, to extract the mid-level feature from 
each frame in a video. Note that we remove the final full-connected layer in the 
ConvNet introduced in 


211 - 


As mentioned previously, training well a deep network like that in Figj3]needs 
huge mount of training data and is quite expensive in computational demand. 
In our case, for dynamic texture or scene, the training data is limited. In stead 
of training a deep network from scratch, which is quite time-consuming, we 
propose to use the pre-trained ConvNet 21] as the initialization, and fine-tune 
the ConvNet with the frames in videos from training data if necessary. By using 
a good initialization, we virtually transfer miscellaneous prior knowledge from 
image domain (e.g., data set ImageNet) to the dynamic textures and scenes 
tasks. 


3.2. Construct Video-Level Representation 

Given a video sequence containing N frames, the ConvNet yields N ConvNet 
features. Note that as the input to the ConvNet in TCoF, we use each frame in 
a video subtracting an averaged image. 

Denote X as a set of the ConvNet features {xi, x 2 ,..., xjv} where x, e R d 
is the ConvNet feature extracted from i-th frame. We extract the first and the 
second order statistics on feature set X. 
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The first-order statistics of X is the mean vector which is defined as follows: 



i= 1 


(i) 


where u captures the average behaviors of the N ConvNet features which reflect 
the average characteristics in the video sequence. 

The second-order statistics is the covariance matrix which is defined as fol¬ 
lows: 

1 N 

S = — - U )( x 8 - U ) T , (2) 

V i=1 

where S describes the variation of the N ConvNet features from the mean 
vector u and the correlations among different dimensions. The dimension of 
covariance feature is dx ^ +1 ) . When d is large (e.g., d = 4096), the dimension of 
the covariance feature is high. Instead, we propose to extract only the diagonal 
entries in S as the second-order feature, that is, 


v = diag(S), (3) 

where diag(-) means to extract the diagonal entries of a matrix as a vector. The 
vector v is d-dimensional and captures the variations along each dimension in 
the ConvNet features. 

Having calculated the first and the second order statistics, we form the video¬ 
level representation, TCoF, by concatenating u and v, i.e., 



where the dimension of a TCoF representation is 2d. 

For clarity, we illustrate the flowchart of constructing a TCoF representation 
for a video sequence in Fig. 2] 

Remarks 1. Our proposed TCoF belongs to global approach. Since the spatial 
layout information can be captured well, we term the TCoF scheme described 
above as the spatial TCoF. Our proposed TCoF possesses the robustness to 
translations, small scale variations, partial rotations, and illumination variations 
owing to the ConvNet component. In addition, the process of extracting a TCoF 
vector is extremely fast since that the ConvNet adopt a so-called stride tactics 
and the second step in TCoF is to calculate the two statistics. 


3.3. Modeling Temporal Information 

While it is well accepted that dynamic information can enrich our under¬ 
standing of the textures or scenes, modeling the dynamic information is difficult. 
Unlike the motion of rigid object, dynamic texture and scene are usually involv¬ 
ing of non-rigid objects and thus the optical flow information seems relatively 
random. 
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Figure 4: An illustration of our TCoF scheme. 


In this paper, we propose to use the difference of the adjacent two frames in 
a short-time to capture the random-like micro-motion patterns. To be specific, 
we take the difference between the (i + r)-th and *-th frames as the input of 
the ConvNet component in TCoF scheme, where r £ { 1 , • - • , iV — 1} is an 
integer which corresponds to the resolution in time to capture the random-like 
micro-motion patterns. In practice, we set rasa small integer, e.g., 1, 2 or 3. 

Given a video sequence containing N frames, the ConvNet produces N — r 
temporal frame-level features. Then we extract the first and the second order 
statistics on the temporal ConvNet features to form a temporal TCoF for the 
input video, in the same way as the spatial TCoF in Section [3.21 
Remarks 2. The temporal TCoF differs from the spatial TCoF in the input of 
the ConvNet. In the spatial TCoF, we take each frame in a video subtracting 
a precalculated average image as input; whereas in the temporal TCoF we take 
the difference of two frames in a short-time and there is no need to subtract an 
average image. 

Remarks 3. In our proposed TCoF, we treat the extracted N ConvNet features 
as a set and ignore the sequential information among features. The rationale of 
this simplification comes from the property of dynamic textures and dynamic 
scenes. Note that the dynamic textures are visual processes of a group of parti¬ 
cles with random motions, and dynamic scenes are places where natural events 
are occurring, the sequential information in these processes are relatively ran¬ 
dom and thus less critical. Experimental results in Section 14.31 support this 
point. 

4. Experiments 

In this section, we introduce the benchmark data sets, the baseline methods, 
and the implementation details, and then present the experimental evaluations 










Alpha 



Beta 


Figure 5: Sample frames from DynTex data set. The “Alpha” , “Beta”, and “Gamma” show 
the sample frames from each category in the data set. 


of our approach. 


4-1- Data Sets Description 

DynTex [§3] is a widely used dynamic texture data set, containing 656 
videos with each sequence recorded in PAL format. The sequences in Dyn¬ 
Tex are divided into three data subsets - “Alpha”, “Beta” and “Gamma”: a) 
“Alpha” data subset contains 60 sequences which are equally divided into 3 
categories: “sea”, “grass” and “trees”; b) “Beta” data subset consists 162 se¬ 
quences which are grouped into 10 categories: “sea”, “grass”, “trees”, “flags”, 
“calm water”, “fountains”, “smoke”, “escalator”, “traffic”, and “rotation”; c) 
“Gamma” data subset is composed of 264 sequences which are grouped into 
10 categories: “flowers”, “sea”, “trees without foliage”, “dense foliage”, “es¬ 
calator”, “calm water”, “flags”, “grass”, “traffic” and “fountains”. Compared 
to “Alpha” and “Beta” data subsets, this data subset contains more complex 
image variations, e.g., scale, orientation, and etc. Sample frames from the three 
data subsets are shown in Fig. [5] 

YUPENN 11 j is a “stabilized” dynamic scenes data set. This data set was 


introduced to emphasize scene-specific temporal information. YUPENN con¬ 
sists of fourteen dynamic scene categories with 30 color videos in each category. 
The sequences in YUPENN have significant variations, such as frame rate, scene 
appearance, scale, illumination, and camera viewpoint. Some sample frames are 
shown in Fig. [6] 
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Figure 6: Samples from dynamic scene data set YUPENN. Each image corresponds to a 
category of video sequence. 


Maryland [34j is a dynamic scene data set which was introduced firstly. It 


consists of 13 categories with 10 videos per category. The data set have large 
variations in illumination, frame rate, viewpoint, and scale. Besides, there are 
variations in resolution and camera dynamics. Some sample frames are shown 
in Fig. □ 


1-2. Baselines and Implementation Details 

Baselines We compare our proposed TCoF approach with the following 
state-of-the-art methodqj. 

• GIST jH}: Holistic representation of the spatial envelope which is widely 
used in 2D static scene classification. 

• Histogram of Flow (HOF) [23]: The HOF is an well-known descriptor in 
action recognition. 

• Local Binary Pattern on Three Orthogonal Planes (LBP-TOP) [ioj]: The 
LBP-TOP is widely used in dynamic texture, dynamic facial expression, 
and dynamic facial micro-expression. 

• Chaotic Dynamic Features (Chaos) [ 34 } . 

• Slow Feature Analysis (SFA) [37I . 

• Synchrony Autoencoder (SAE) [20| . 

• Synchrony K-means (SK-means) [20| . 


x For the LBP-TOP, we report the results with our own implementation and for other 
methods we cite the results from their papers. 
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Figure 7: Sample frames from Maryland scenes data set. 


• Complementary Spacetime Orientation (CSO) [l5}: In CSO, the com¬ 
plementary spatial and temporal features are fused in a random forest 
framework. 

• Bag of Spacetime Energy (BoSE) jl^| . 

Implementation Details. In both the spatial TCoF (s-TCoF) and the 
temporal TCoF (t-TCoF), we resize the frame into 224 x 224 and normalize 
both s-TCoF and t-TCoF with L 2 -norm, respectively. For the combination of 
both the spatial and temporal TCoF, we take the concatenation of the two 
normalized s-TCoF and t-TCoF and denote it as st-TCoF. We do not use any 
data augmentation method. For our t-TCoF, we use r = 3. We use the CAFFE 
toolbox [l8[ to extract the proposed TCoFs. Note that we take the weights in 
each layers but the final full connection layer in the well-trained ConvNet 
as the initialization. And then, the whole ConvNet could be fine-tuned with 
the train data. While the fine-tuning stage is easier than training a ConvNet 
from scratch with random initialization, we observed that the improvement by 
the extra fine-tuning was minor. Thus we use the ConvNet without a further 
fine-tuning to extract the mid-level features^ For LBP-TOP, we use the best 
performing setting of LBP-TOPg^^.i,!,!, and the y 2 kernel. To fairly compare 
with previous methods, we test our approach and other baselines with both 
the nearest neighbor (NN) classifier and SVM classifier separately. In SVM, we 
use a linear SVM with Libsvm toolbox Q, in which the tradeoff parameter C 
is fixed to 40 in all our experiments. Following the standard protocol, we use 


2 Note that we use the Leave-One-Out cross-validation to evaluate the performance. The 
training data are changed from each trial. If we chose to fine-tune the ConvNet, we should 
fine-tune for each trial. Since the improvements were minor, we report the results without a 
fine-tuning to keep all experimental results are repeatable. 
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Leave-One-Out (LOO) cross-validation. 

f.3. Evaluation of the Spatial and Temporal TCoFs 

In this subsection, we evaluate systematically the influence of using the spa¬ 
tial and temporal TCoFs on DynTex, YUPENN, and Maryland data sets. 

Effectiveness of the s-TCoF. Since that the s-TCoF features are con¬ 
structed by accumulating all features in all frames, it is interesting to investigate 
the effect to the final performance of using different number of frames. To this 
end, we evaluate the s-TCoF using seven different settings: 1) using only the 
first frame in a video, 2) using the first frames in a video, 3) using the first 
-j frames in a video, 4) using the first ^ frames in a video, and 5) using all N 
frames in a video. Experimental results are shown in Table [l] 


Table 1: Evaluation of the spatial TCoF (s-TCoF) by using different number of frames. 


Datasets 

1st 

iv- 

8 

iv- 

4 

iv- 

2 

N 

Alpha(NN) 

100 

100 

100 

100 

100 

Beta (NN) 

98.77 

99.38 

99.38 

98.77 

99.38 

Gamma(NN) 

97.73 

97.35 

96.97 

96.97 

96.59 

YUPENN(NN) 

95.71 

96.43 

96.19 

96.43 

95.48 

YUPENN (SVM) 

96.90 

96.90 

96.90 

97.14 

96.90 

Maryland (NN) 

72.31 

80.00 

75.38 

77.69 

76.92 

Maryland(SVM) 

80.00 

83.85 

80.77 

83.08 

88.46 


From Table [TJ we can see that the spatial TCoF performs well by even using 
the first frame only , on DynTex and YUPENN data sets. This confirmed the 
effectiveness of the spatial TCoF scheme. 


Table 2: Evaluating the performance of temporal TCoF (t-TCoF) as a function of parameter 


Datasets 

T = 1 

T = 2 

T = 3 

T = 4 

T = 5 

Alpha(NN) 

98.33 

96.67 

96.67 

96.67 

96.67 

Beta (NN) 

97.53 

96.91 

97.53 

97.53 

97.53 

Gamma(NN) 

93.56 

94.32 

93.18 

93.18 

93.94 

YUPENN (NN) 

90.24 

91.19 

92.38 

93.57 

93.10 

YUPENN (SVM) 

94.52 

96.19 

96.67 

96.90 

97.14 

Maryland(NN) 

55.38 

56.92 

57.69 

61.54 

63.85 

Maryland(SVM) 

66.92 

62.31 

61.54 

63.85 

63.85 


Effectiveness of the t-TCoF. Here, we conduct experiments to evaluate 
the influence of the parameter r. Experimental results are shown in Table [2] 
We observe from Table [5] that t-TCoF is not sensitive to the choice of r. 

Comparison of s-TCof and t-TCoF. Note that almost all the results of 
s-TCoF in Table [T] outperform that of t-TCoF in Tabic [2] This suggest that s- 
TCoF is more effective than t-TCoF, since the randomness of the micro-motions 
in dynamic texture or natural dynamic scene makes the temporal information 
less critical. Nevertheless t-TCoF could provide complementary information to 
s-TCoF in some case that will be shown later. 
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4-4- Comparisons with the State-of-the-Art Methods 

Dynamic Texture Classification on DynTex Data Set We conduct 
a set of experiments to compare our methods with LBP-TOP. Experimental 
results are shown in Table [3] 


Table 3: Classification Results on DynTex dynamic texture data set. The performance of 
LBP-TOP is based on our implementation. All methods use NN classifier. 


Datasets 

LBP-TOP 

s-TCoF 

t-TCoF 

st-TCoF 

Alpha 

96.67 

100 

96.67 

98.33 

Beta 

85.80 

99.38 

97.53 

98.15 

Gamma 

84.85 

95.83 

93.56 

98.11 


We observe from Table [5] that: 

1. The s-TCoF performs the best on data subsets Alpha and Beta. This 
results confirm that the s-TCoF is effective for dynamic texture classifica¬ 
tion. 

2. On Gamma subset, s-TCoF and t-TCoF significantly outperform LBP- 
TOP. Moreover by combining s-TCoF with t-TCoF, we achieve the best 
result. This result suggests that t-TCoF might provide complementary 
information to s-TCoF. 

Dynamic Scene Classification on YUPENN. We compare our meth¬ 
ods with the state-of-the-art methods, including CSO, GIST, SFA, SOE, SAE, 
BOSE, SK-means, and LBP-TOP, and the experimental results are presented 
in Tableland Table [5] We observe from Table [4] that: 

1. The s-TCoF and t-TCoF both outperform the state-of-the-art methods. 
Note that YUPENN consist of stabilized videos. These results confirm 
that both s-TCoF and t-TCoF are effective for dynamic scene data in a 
stabilized setting. 

2. The combination of the s-TCoF and t-TCoF, i.e., the st-TCoF, performs 
the best, in which reduce the error relatively over 30%. As shown is Ta¬ 
bled! s-TCoF and t-TCoF are complementary to each other on some cat¬ 
egories, e.g., “Light Storm”, “Railway”, “Snowing”, and “Wind. Farm”. 


Table 4: Classification results on scene data set YUPENN. The results of are taken from the 
corresponding papers. The performance of LBP-TOP is based on our implementation. 


Methods 

CSO 

GIST 

SFA 

SAE 

SOE 

BoSE SK-means 

LBP-TOP 

s-TCoF 

t-TCoF 

st-TCoF 

NN 

SVM 

85.95 

56 

85.48 

80.7 

96.0 

74 

80.71 

96.19 95.2 

75.95 

84.29 

96.43 

97.14 

93.10 

97.86 

98.81 

99.05 


Dynamic Scene Classification on dynamic scene data set Maryland. 

We present the comparison of our methods with the state-of-the-art methods in 
Tabled] We observe from Tabled] that: 
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Table 5: Category-wise accuracy (%) for different methods on dynamic scene data set YU- 
PENN. All methods use linear SVM classifier. Our methods and LBP-TOP are based on our 


implementation. The other results are taken from 0. 


Categories 

HOF+ 

GIST 

Chaos+ 

GIST 

SOE 

SFA 

CSO 

LBP-TOP 

BoSE 

s-TCoF 

t-TCoF 

st-TCoF 

Beach 

87 

30 

93 

93 

100 

87 

100 

97 

97 

97 

Elevator 

87 

47 

100 

97 

100 

97 

97 

100 

100 

100 

Forest Fire 

63 

17 

67 

70 

83 

87 

93 

100 

97 

100 

Fountain 

43 

3 

43 

57 

47 

37 

87 

100 

97 

100 

Highway 

47 

23 

70 

93 

73 

77 

100 

97 

100 

100 

Light Storm 

63 

37 

77 

87 

93 

93 

97 

90 

100 

100 

Ocean 

97 

43 

100 

100 

90 

97 

100 

100 

100 

100 

Railway 

83 

7 

80 

93 

93 

80 

100 

97 

100 

100 

Rush River 

77 

10 

93 

87 

97 

100 

97 

97 

97 

97 

Sky-Clouds 

87 

47 

83 

93 

100 

93 

97 

100 

97 

100 

Snowing 

47 

10 

87 

70 

57 

83 

97 

90 

97 

97 

Street 

77 

17 

90 

97 

97 

93 

100 

100 

97 

100 

Waterfall 

47 

10 

63 

73 

77 

90 

83 

93 

93 

97 

Wind. Farm 

53 

17 

83 

87 

93 

67 

100 

100 

100 

100 

Overall 

68.33 

22.86 

80.71 

85.48 

85.95 

84.29 

96.19 

97.14 

97.86 

99.05 


1. The s-TCoF significantly outperforms the other methods. This suggests 
that the spatial information is extremely important for scene understand¬ 
ing. 

2. The results of t-TCoF are much worse than s-TCoF. This might be due 
to the significant camera motions in this data set. 


Table 6: Classification results on scene data set Maryland. The results of are taken from the 
corresponding papers. The performance of LBP-TOP is based on our implementation. 


Methods 

CSO 

SFA 

SOE 

BoSE 

LBP-TOP 

s-TCoF 

t.-TCoF 

st-TCoF 

NN 

- 

- 

- 

- 

31.54 

74.62 

58.46 

74.62 

SVM 

67.69 

60 

43.1 

77.69 

39.23 

88.46 

66.15 

88.46 


Jf.,5. Further Investigations and Remarks 

Data Visualization. To show the discriminative power of the proposed 
approach, we use t-Stochastic Neighbor Embedding (t-SNEjl fH] to visualize 
the data distributions of the dynamic scene data sets YUPENN and Maryland. 
Results are shown in Fig.[5]and Fig.[9l respectively. We observe from Fig.[S]and 
Fig. [9] that s-TCoF, t-TCoF, and st-TCoF yield distinct separations between 
categories. These results reveal the effectiveness of our proposed TCoF approach 
vividly. 

Remarks. Note that in our TCoF schemes, we treat the frames in a video 
as orderless images and extract mid-level features with a ConvNet. By doing so, 
the sequential information among features is ignored. The superior experimental 
results suggest that such a simplification is harmless. The sequential information 


3 t-SNE is a (prize-winning) technique for dimensionality reduction that is particularly well 
suited for the visualization of high-dimensional data. 
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Figure 8: Data visualization of LBP-TOP, s-TCoF, t-TCoF, and st-TCoF on YUPENN dy¬ 
namic scene data set. Each point in the figure corresponds to a video sequence. 
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Figure 9: Data visualization of LBP-TOP, s-TCoF, t-TCoF, and st-TCoF on Maryland dy¬ 
namic scene data set. Each point in the figure corresponds to a video sequence. 


in these processes contributes less (or not at all) discriminativeness, because that 
the dynamic textures can be viewed as visual processes of a group of particles 
with random motions, and the dynamic scenes are places where natural events 
are occurring. The effectiveness underlying our proposed TCoF approach for 
dynamic texture and scene classification is due to the following aspects: 

• Rich filters’ combination built on color channels in ConvNet describes 
richer structures and color information. The filters that are built on dif¬ 
ferent image patches capture stronger and richer structures compared to 
the hand-crafted features. 

• ConvNet makes the extracted features robust to sorts of image transfor¬ 
mations due to the max-pooling and LCN components. Specifically the 
max-pooling tactics makes ConvNet robust to translations, small scale 
variations, and partial rotations, and LCN makes ConvNet robust to illu¬ 
mination variations. 

• The first and the second order statistics capture enough information over 
the mid-level features. 

• When the video sequences are stabilized, the t-TCoF might provide com¬ 
plementary information to the s-TCoF. 
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5. Conclusion and Discussion 


We have proposed a robust and effective feature extraction approach for 
dynamic texture and scene classification, termed as Transferred ConvNet Fea¬ 
tures (TCoF), which was built on the first and the second order statistics of the 
mid-level features extracted by a ConvNet with transferred knowledge from im¬ 
age domain. We have investigated two different implementations of the TCoF 
scheme, i.e., the spatial TCoF and the temporal TCoF. We have evaluated 
systematically the proposed approaches on three benchmark data sets and con¬ 
firmed that: a) the proposed spatial TCoF was effective, and b) the temporal 
TCoF could provides complementary information when the camera is stabilized. 

Unlike images, representing a video sequence needs to consider the following 
aspects: 

1. To depict the spatial information. In most cases, we can recognize the dy¬ 
namic textures and scenes from a single frame in a video. Thus, extracting 
the spatial information of each frame in a video might be an effective way 
to represent the dynamic textures or scenes. 

2. To capture the temporal information. In dynamic textures or scenes, there 
are some specific micro-motion patterns. Capturing these micro-motion 
patterns might help to better understand the dynamic textures or scenes. 

3. To fuse the spatial and temporal information. When the spatial and 
temporal information are complementary, combining both of them might 
boost the recognition performance. 

Different from the rigid or semi-rigid objects (e.g., actions), the dynamics 
of texture and scene are relatively random and non-directional. Whereas the 
temporal information might be the most important cue in action recognition [ 23 L 
m , our investigation in this paper suggests that the sequential information in 
dynamic textures or scenes is not that critical for classification. 
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